16 research outputs found
Minimax Optimal Estimation of Stability Under Distribution Shift
The performance of decision policies and prediction models often deteriorates
when applied to environments different from the ones seen during training. To
ensure reliable operation, we propose and analyze the stability of a system
under distribution shift, which is defined as the smallest change in the
underlying environment that causes the system's performance to deteriorate
beyond a permissible threshold. In contrast to standard tail risk measures and
distributionally robust losses that require the specification of a plausible
magnitude of distribution shift, the stability measure is defined in terms of a
more intuitive quantity: the level of acceptable performance degradation. We
develop a minimax optimal estimator of stability and analyze its convergence
rate, which exhibits a fundamental phase shift behavior. Our characterization
of the minimax convergence rate shows that evaluating stability against large
performance degradation incurs a statistical cost. Empirically, we demonstrate
the practical utility of our stability framework by using it to compare system
designs on problems where robustness to distribution shift is critical
Diagnosing Model Performance Under Distribution Shift
Prediction models can perform poorly when deployed to target distributions
different from the training distribution. To understand these operational
failure modes, we develop a method, called DIstribution Shift DEcomposition
(DISDE), to attribute a drop in performance to different types of distribution
shifts. Our approach decomposes the performance drop into terms for 1) an
increase in harder but frequently seen examples from training, 2) changes in
the relationship between features and outcomes, and 3) poor performance on
examples infrequent or unseen during training. These terms are defined by
fixing a distribution on while varying the conditional distribution of between training and target, or by fixing the conditional distribution
of while varying the distribution on . In order to do this, we
define a hypothetical distribution on consisting of values common in both
training and target, over which it is easy to compare and thus
predictive performance. We estimate performance on this hypothetical
distribution via reweighting methods. Empirically, we show how our method can
1) inform potential modeling improvements across distribution shifts for
employment prediction on tabular census data, and 2) help to explain why
certain domain adaptation methods fail to improve model performance for
satellite image classification
Modeling Interference Using Experiment Roll-out
Experiments on online marketplaces and social networks suffer from
interference, where the outcome of a unit is impacted by the treatment status
of other units. We propose a framework for modeling interference using a
ubiquitous deployment mechanism for experiments, staggered roll-out designs,
which slowly increase the fraction of units exposed to the treatment to
mitigate any unanticipated adverse side effects. Our main idea is to leverage
the temporal variations in treatment assignments introduced by roll-outs to
model the interference structure. We first present a set of model
identification conditions under which the estimation of common estimands is
possible and show how these conditions are aided by roll-out designs. Since
there are often multiple competing models of interference in practice, we then
develop a model selection method that evaluates models based on their ability
to explain outcome variation observed along the roll-out. Through simulations,
we show that our heuristic model selection method, Leave-One-Period-Out,
outperforms other baselines. We conclude with a set of considerations,
robustness checks, and potential limitations for practitioners wishing to use
our framework